Fine-Tune Qwen2.5-3B with DPO
Fine-Tuning Qwen2.5-3B with DPO using Unsloth on Preference TinyStories Dataset
Supervised fine-tuning for large language models (LLMs) enables precise adaptation of a model to specific tasks or domains, significantly improving its performance by providing labeled data tailored to desired outputs. This technique is particularly beneficial in scenarios where off-the-shelf LLMs fail to meet domain-specific or task-specific requirements, such as legal document summarization or medical diagnostics, where accuracy and relevance are critical.
However, supervised fine-tuning alone may not address alignment with human preferences, especially in open-ended or subjective tasks. This is where Reinforcement Learning from Human Feedback (RLHF) techniques come into handy. As they align the model’s outputs with user preferences by incorporating feedback on what users value most, ensuring that the LLM generates responses that are not only accurate but also contextually aligned with human expectations and ethical considerations. These fine-tuning techniques are often named as fine-tuning with preference aligmnent.
This post focuses on Direct Preference Optimization (DPO), a fine-tuning technique that aligns large language models (LLMs) with human preferences by directly optimizing the model’s outputs based on human feedback.
Preference Aligmnent with DPO
DPO is introduced in the paper [Direct Preference Optimization: Your Language Model is Secretly a Reward Model] (https://arxiv.org/abs/2305.18290) by R. Rafailov et al in 2023.
Unlike Proximal Policy Optimization (PPO), a reinforcement learning algorithm that requires training a separate reward model and iterative sampling, DPO simplifies the process by using a supervised learning framework to adjust the model’s behavior according to ranked human preferences.
The principle of DPO involves collecting human preferences on model outputs and using a binary cross-entropy objective to steer the model towards producing desired responses.
This method offers several benefits: it simplifies the training process, reduces computational requirements, and potentially leads to faster and more effective alignment with human values. Additionally, it can be more efficient in mitigating the risk of inheriting biases from training data.
In the next session, we will work on an use-case where we implement DPO to fine-tune a base LLM for a specific question-answering task that align with human preference.
Use-case
In the previous post, Fine-tune Qwen2.5-3B using Lora with Unsloth, we have fine-tuned the base model Qwen2.5-3B on the instruction dataset TinyStories_Instruction using Parameter-Efficient Fine-Tuning (PEFT) technique like LoRA to obtain a story generator for childrens given an story instruction request.
Continue this use-case, in this post we will also fine-tune the base model Qwen2.5-3B, but using DPO. For this end, we need to use a preference dataset TinyStories_Preference. Please refer this post Create Preference Dataset for DPO Fine-Tuning for more detail about this dataset.
At the end we will compare the stories generated by the two methods.
Implementation
Install and import required packages
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft accelerate bitsandbytes
!pip install -q comet_mlfrom unsloth import PatchDPOTrainer
PatchDPOTrainer()
import os
import torch
from datasets import load_dataset
from transformers import TrainingArguments, TextStreamer
from unsloth import FastLanguageModel, is_bfloat16_supported
from trl import DPOConfig, DPOTrainer
from google.colab import userdata
import comet_mlInitialize Comet ML for Experiment Tracking
comet_ml.login(project_name="dpo-lora-unsloth")Load Pretrained Model and Tokenizer
max_seq_length = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="Qwen/Qwen2.5-3B",
max_seq_length=max_seq_length,
load_in_4bit=False,
)Apply LoRA Adaptation
model = FastLanguageModel.get_peft_model(
model,
r=32,
lora_alpha=32,
lora_dropout=0,
target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"],
)Dataset Preparation
Format the dataset using a specific template and split it for training and testing.
alpaca_template = """Below is an instruction that describes a task.
Write a response that appropriately completes the request.
### Instruction:
{}
### Response:
"""
EOS_TOKEN = tokenizer.eos_token
def format_samples(example):
example["prompt"] = alpaca_template.format(example["prompt"])
example["chosen"] = example['chosen'] + EOS_TOKEN
example["rejected"] = example['rejected'] + EOS_TOKEN
return {
"prompt": example["prompt"],
"chosen": example["chosen"],
"rejected": example["rejected"]
}
dataset = dataset.map(format_samples)
dataset = dataset.train_test_split(test_size=0.05)Training Using DPOTrainer
Configure and train the model using the DPOTrainer class.
trainer = DPOTrainer(
model=model,
ref_model=None,
tokenizer=tokenizer,
beta=0.5,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
max_length=max_seq_length//2,
max_prompt_length=max_seq_length//2,
args=DPOConfig(
learning_rate=2e-6,
lr_scheduler_type="linear",
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
gradient_accumulation_steps=8,
num_train_epochs=1,
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
optim="adamw_8bit",
weight_decay=0.01,
warmup_steps=10,
output_dir="output",
eval_strategy="steps",
eval_steps=0.2,
logging_steps=1,
report_to="comet_ml",
seed=0,
),
)
trainer.train()Model Inference
Generate a response using the fine-tuned model.
FastLanguageModel.for_inference(model)
message = alpaca_template.format("Write a story about a humble little bunny \
named Ben who follows a mysterious trail in the woods, \
discovering beautiful flowers, new friends, and a lovely pond along the way.", "")
inputs = tokenizer([message], return_tensors="pt").to("cuda")
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=2048, use_cache=True)Save and Push to Hugging Face Hub
from huggingface_hub import login
# Log in to the Hugging Face Hub
login(token=userdata.get('HF_TOKEN'))
model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")
model.push_to_hub_merged("tanquangduong/Qwen2.5-3B-DPO-TinyStories", tokenizer, save_method="merged_16bit")Conclusion
In this guide, we demonstrated how to fine-tune Qwen2.5-3B using Direct Preference Optimization (DPO) within the Unsloth framework. By leveraging LoRA for parameter-efficient adaptation, we tailored the model’s output behavior to better suit our target use case of generating child-friendly Tiny Stories. This methodology highlights the effectiveness of combining DPO and LoRA to achieve powerful, specialized fine-tuned models.